33 research outputs found

    Distributed multinomial regression

    Full text link
    This article introduces a model-based approach to distributed computing for multinomial logistic (softmax) regression. We treat counts for each response category as independent Poisson regressions via plug-in estimates for fixed effects shared across categories. The work is driven by the high-dimensional-response multinomial models that are used in analysis of a large number of random counts. Our motivating applications are in text analysis, where documents are tokenized and the token counts are modeled as arising from a multinomial dependent upon document attributes. We estimate such models for a publicly available data set of reviews from Yelp, with text regressed onto a large set of explanatory variables (user, business, and rating information). The fitted models serve as a basis for exploring the connection between words and variables of interest, for reducing dimension into supervised factor scores, and for prediction. We argue that the approach herein provides an attractive option for social scientists and other text analysts who wish to bring familiar regression tools to bear on text data.Comment: Published at http://dx.doi.org/10.1214/15-AOAS831 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Multinomial Inverse Regression for Text Analysis

    Full text link
    Text data, including speeches, stories, and other document forms, are often connected to sentiment variables that are of interest for research in marketing, economics, and elsewhere. It is also very high dimensional and difficult to incorporate into statistical analyses. This article introduces a straightforward framework of sentiment-preserving dimension reduction for text data. Multinomial inverse regression is introduced as a general tool for simplifying predictor sets that can be represented as draws from a multinomial distribution, and we show that logistic regression of phrase counts onto document annotations can be used to obtain low dimension document representations that are rich in sentiment information. To facilitate this modeling, a novel estimation technique is developed for multinomial logistic regression with very high-dimension response. In particular, independent Laplace priors with unknown variance are assigned to each regression coefficient, and we detail an efficient routine for maximization of the joint posterior over coefficients and their prior scale. This "gamma-lasso" scheme yields stable and effective estimation for general high-dimension logistic regression, and we argue that it will be superior to current methods in many settings. Guidelines for prior specification are provided, algorithm convergence is detailed, and estimator properties are outlined from the perspective of the literature on non-concave likelihood penalization. Related work on sentiment analysis from statistics, econometrics, and machine learning is surveyed and connected. Finally, the methods are applied in two detailed examples and we provide out-of-sample prediction studies to illustrate their effectiveness.Comment: Published in the Journal of the American Statistical Association 108, 2013, with discussion (rejoinder is here: http://arxiv.org/abs/1304.4200). Software is available in the textir package for

    One-step estimator paths for concave regularization

    Full text link
    The statistics literature of the past 15 years has established many favorable properties for sparse diminishing-bias regularization: techniques which can roughly be understood as providing estimation under penalty functions spanning the range of concavity between L0L_0 and L1L_1 norms. However, lasso L1L_1-regularized estimation remains the standard tool for industrial `Big Data' applications because of its minimal computational cost and the presence of easy-to-apply rules for penalty selection. In response, this article proposes a simple new algorithm framework that requires no more computation than a lasso path: the path of one-step estimators (POSE) does L1L_1 penalized regression estimation on a grid of decreasing penalties, but adapts coefficient-specific weights to decrease as a function of the coefficient estimated in the previous path step. This provides sparse diminishing-bias regularization at no extra cost over the fastest lasso algorithms. Moreover, our `gamma lasso' implementation of POSE is accompanied by a reliable heuristic for the fit degrees of freedom, so that standard information criteria can be applied in penalty selection. We also provide novel results on the distance between weighted-L1L_1 and L0L_0 penalized predictors; this allows us to build intuition about POSE and other diminishing-bias regularization schemes. The methods and results are illustrated in extensive simulations and in application of logistic regression to evaluating the performance of hockey players.Comment: Data and code are in the gamlr package for R. Supplemental appendix is at https://github.com/TaddyLab/pose/raw/master/paper/supplemental.pd

    Estimation and Inference about Heterogeneous Treatment Effects in High-Dimensional Dynamic Panels

    Full text link
    This paper provides estimation and inference methods for a large number of heterogeneous treatment effects in a panel data setting with many potential controls. We assume that heterogeneous treatment is the result of a low-dimensional base treatment interacting with many heterogeneity-relevant controls, but only a small number of these interactions have a non-zero heterogeneous effect relative to the average. The method has two stages. First, we use modern machine learning techniques to estimate the expectation functions of the outcome and base treatment given controls and take the residuals of each variable. Second, we estimate the treatment effect by l1-regularized regression (i.e., Lasso) of the outcome residuals on the base treatment residuals interacted with the controls. We debias this estimator to conduct pointwise inference about a single coefficient of treatment effect vector and simultaneous inference about the whole vector. To account for the unobserved unit effects inherent in panel data, we use an extension of correlated random effects approach of Mundlak (1978) and Chamberlain (1982) to a high-dimensional setting. As an empirical application, we estimate a large number of heterogeneous demand elasticities based on a novel dataset from a major European food distributor
    corecore